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Abstract 

This paper addresses signal denoising when large-amplitude coefficients form clusters (groups). The 
Ll-norm and other separable sparsity models do not capture the tendency of coefficients to cluster (group 
sparsity). This work develops an algorithm, called 'overlapping group shrinkage' (OGS), based on the min- 
imization of a convex cost function involving a group-sparsity promoting penalty function. The groups are 
fully overlapping so the denoising method is translation-invariant and blocking artifacts are avoided. Based 
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on the principle of majorization-minimization (MM), we derive a simple iterative minimization algorithm 
that reduces the cost function monotonically. A procedure for setting the regularization parameter, based 

o 

speech enhancement, wherein the OGS approach is applied in the short-time Fourier transform (STFT) 



on attenuating the noise to a specified level, is also described. The proposed approach is illustrated on 



!LJ domain. The denoised speech produced by OGS does not suffer from musical noise. 

X 

1 Introduction 

In recent years, many algorithms based on sparsity have been developed for signal denoising, deconvo- 
lution, restoration, and reconstruction, etc. [23]. These algorithms often utilize nonlinear scalar shrink- 
age/thresholding functions of various forms which have been devised so as to obtain sparse representations. 
Examples of such functions are the hard and soft thresholding functions [22], and the nonncgative gar- 
rote [29,30]. Numerous other scalar shrinkage/thresholding functions have been derived as MAP or MMSE 
estimators using various probability models, e.g. [25,37,46]. 
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We address the problem of dcnoising, i.e. estimating x(i), i € I, from noisy observations y(i), 

y(i) = x(i) + w(i), iel, (1) 

where the signal x(i) is known to have a group sparse property and w(i) is white Gaussian noise. Here, I 
denotes the domain of x. typically X = {0. . . . , N — 1} for one-dimensional finite-length signals. A generally 
effective approach for deriving shrinkage/thresholding functions is to formulate the optimization problem 

x* = argmin |f(x) = \ ||y - x||| + Ai?(x) J (2) 

where x = (#», % G I) is the signal to be determined from the observation y = (j/j, i € X). In problem (2), 
x may represent either the coefficients of a signal (e.g. wavelet or short-time Fourier transform coefficients) 
or the signal itself, if the signal is well modeled as sparse. The penalty function -R(x) (rcgularizcr) should 
be chosen to promote the known behavior of x. Many of the shrinkage/thresholding functions devised in the 
literature can be derived as solutions to (2), where i?(x) is specifically of the separable form 



i?(x)=]Tr(x(z)). (3) 
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For example, soft-thresholding is derived as the solution to (2) where r(x) — \x\, which corresponds to the 
lasso problem [70] or the basis pursuit denoising problem [11]. When -R(x) has the form (3), the variables 
x(i) in (2) are decoupled, and the optimization problem (2) is equivalent to a set of scalar optimization 
problems. Therefore, the separable form (3) significantly simplifies the task of solving (2), because in this 
case the optimal solution is obtained by applying a scalar shrinkage/thresholding function independently to 
each element x(i) of x. From a Bayesian viewpoint, the form (3) models the elements x(i) as being statistically 
independent. 

For many natural (physically arising) signals, the variables (signal/coefficients) x are not only sparse but 
also exhibit a clustering or grouping property. For example, wavelet coefficients generally have inter and 
intra-scale clustering tendencies [43,65,69]. Likewise, the clustering/grouping property is also apparent in a 
typical speech spectrogram. In both cases, significant (large-amplitude) values of x tend not to be isolated. 

This paper develops a simple translation-invariant shrinkage/thresholding algorithm that exploits the 
grouping/clustering properties of the signal/coefficient vector x. The algorithm acts on x as a whole without 
performing block-by-block processing, and minimizes the cost function (2) with the (non-separable) penalty 
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#(x) = E 
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(4) 



where the set J7 defines the group. The algorithm is derived using the majorization-minimization (MM) 
method [28,53]. The algorithm can be considered an extension of the successive substitution algorithm for 
multivariate thresholding [63] or as an extension of the FOCUSS algorithm [59]. 

The penalty function (4) has been considered previously, but existing algorithms for its minimization 
[2,3,20,27,39,40,54] are based on variable duplication (variable splitting, latent/auxilliary variables, etc.). 
The number of additional parameters is proportional to the overlap; hence their implementations require 
additional memory and accompanying data indexing. The iterative algorithm proposed here avoids variable 
duplication and can be efficiently implemented via separable convolutions. 

For the purpose of denoising, this paper also develops a conceptually simple method to set the regu- 
larization parameter A analogous to the '3cr' rule. The method allows for A to be selected so as to ensure 
that the noise variance is reduced to a specified fraction of its original value. This method does not aim 
to minimize the mean square error or any other measure involving the signal to be estimated, and is thus 
non-Bayesian. Although conceptually simple, the method for setting A is analytically intractable due to the 
lack of explicit functional form of the estimator. However, with appropriate pre-computation, the method 
can be implemented by table look-up. 

In Section 2, the cost function is given and the iterative successive substitution algorithm for its minimiza- 
tion is presented. In Section 3, the effect of the shrinkage/thresholding algorithm on white Gaussian noise 
is investigated and is used to devise a simple method for selecting the regularization parameter A. Section 4 
illustrates the proposed approach to signal denoising, including speech enhancement. 

1.1 Related work 

The penalty function (4) can be considered a type of mixed norm. Mixed norms and their variants can be 
used to describe non-overlapping group sparsity [24,41,42,74] or overlapping group sparsity [2,3,12,20,27,38, 
39,51,54,73]. Algorithms derived using variable splitting and the alternating direction method of multipliers 
(ADMM) are described in [7,20,27]. These algorithms duplicate each variable for each group of which the 
variable is a member. The algorithms have guaranteed convergence, although additional memory is required 
due to the variable duplication (variable splitting) . A theoretical study regarding recovery of group support 
is given in [38] which uses an algorithm that is also based on variable duplication. A more computationally 
efficient version of [38] for large data sets is described in [51] which is based on identifying active groups (non- 
zero variables). The identification of non-zero groups is also performed in [73] using an iterative process to 



reduce the problem size; a dual formulation is then derived which also involves auxiliary variables. Auxiliary 
and latent variables are also used in the algorithms described in [2,3,40,54], which, like variable splitting, 
calls for additional memory proportional to the extent of the overlap. Overlapping groups arc used to induce 
sparsity patterns in the wavelet domain in [61] also employing variable duplication. 

Several other algorithmic approaches and models have been devised to take into account clustering or 
grouping behavior, such as: Markov models [10,18,21,26,31,50,55,71], Gaussian scale mixtures [33,56], 
neighborhood-based methods with locally adaptive shrinkage [8,49,66], and multivariate shrinkage/thresholding 
functions [1,13,57,58,62-64]. These algorithms depart somewhat from the variational approach wherein a cost 
function of the form (2) is minimized. For example, in many neighborhood-based and multivariate thresh- 
olding methods, local statistics are computed for a neighborhood/block around each coefficient, and then 
these statistics arc used to estimate the coefficients in the neighborhood (or the center coefficient). In some 
methods, the coefficients are segmented into non-overlapping blocks and each block is estimated as a whole; 
however, in this case the processing is not translation-invariant and some coefficient clusters may straddle 
multiple blocks. 

It should be noted that an alternative penalty function for promoting group sparsity is proposed in [38,52] 
(see equation (3) of [52]). Interestingly, it is designed to promote sparse solutions for which the significant 
values tend to be comprised of the unions of groups; while, as discussed in [38,52], the penalty function (4) 
promotes sparse solutions for which the significant values tend to be comprised of the complement of the 
unions of groups. The focus of this manuscript is an efficient algorithm for the minimization of (2) with 
penalty function (4) and the corresponding selection of regularization parameter A. However, the extension 
of this work to the penalty function of [38, 52] is also of interest. 

2 Overlapping group shrinkage/thresholding 

2.1 Motivation 

As shown in [18,43,69], neighboring coefficients in a wavelet transform have statistical dependencies even 
when they are uncorrclated. In particular, a wavelet coefficient is more likely to be large in amplitude if 
the adjacent coefficients (in scale or spatially) are large. This behavior can be modeled using suitable non- 
Gaussian multivariate probability density functions, perhaps the simplest one being 

p(x) = ^cxp(-^±I||x|| 2 Y xel J (5) 



as utilized in [63]. If the coefficients x are observed in additive white Gaussian noise, y = x + w, then the 
MAP estimator of x is obtained by solving (2) with i?(x) = ||x||2, the solution of which is given by 

x=fl-A N ) y , (6) 

V lly||/ + 

where (x) + := max(i, 0). The function (6) can be considered a multivariate form of soft thresholding with 
threshold A. 

The multivariate model and related models are convenient for estimating small blocks/neighborhoods 
within a large array of coefficients; however, when each coefficient is a member of multiple groups, then cither 
estimated coefficients are discarded (e.g. only the center of each block of estimated coefficients is retained as 
in sliding window processing) or the blocks are sub-sampled so that they are not overlapping. In the first case, 
the result does not correspond directly to the minimization of a cost function; whereas, in the second case, 
the process is not translation-invariant and issues may arise due to blocks not aligning with the group sparsity 
structure within the array. Here we are interested in a variational approach based on fully-overlapping groups 
so that the derived algorithm is translation-invariant and blocking artifacts are avoided. The penalty function 
(4) is clearly translation-invariant. 

2.2 Penalty function 

Assuming that x has a group sparsity (clustering) property, to perform translation-invariant denoising, we 
add the £2 norm for each group to obtain the penalty function (4) . The penalty function (4) is convex and the 
cost function F in (2) is strictly convex (because ||y — x||| is strictly convex in x). The index i is the group 
index, and j is the coefficient index within group i. Each group has the same size, namely \J\. In practice, 
x is of finite size; hence, the sum over % in (4) is a finite sum. To deal with the boundaries of x, we take x(i) 
in (4) as zero when i falls outside X; that is, x(i) = for i ^ I. 

For one-dimensional vectors x of length TV with group size K, we set I in (I) to 

1={0, ...,N-1}, (7) 

and J in (4) to 

J = {0,...,K-1}. (8) 

Note that in (4), the groups are fully overlapping, as per a sliding window shifted a single sample at a time. 
It is possible to include a weighting w(j) in (4) as in Ref. [67,68]. 



For a two-dimensional array x of size N± x N2 with group size K\ x K 2, we set X in (1) to 



and J in (4) to 



Z={(*i,t 2 ) : 0<»i< JVi-1, 0<* 2 <JV 2 -1}, 



J = {OW2) : < jt < K, - 1, < j 2 < K 2 - 1}. 



In the two-dimensional case, i+j denotes (i\ + ji, «2 +.72)- The same notation extends to higher dimensions. 

Note that minimizing (2) with (4) can only shrink the data vector y toward zero. That is, the minimizer x 
of (2) will lie point-wise between zero and y, i.e. x(i) € [0,y(i)] for all j £l. This can be shown by observing 
that the penalty function in (4) is a strictly increasing function of \x(i)\ and is independent of the sign of each 
x(i). As a result, if y(i) = for some i S I, then x*(i) = also, where x* is the minimizer of F. 

An important point regarding R in (4) is that it is non-diffcrentiable. In particular, if any group of x is 
equal to zero, then R is non-diffcrentiable at x. 

2.3 Algorithm 

To minimize the cost function -F(x) in (2), we use the majorization-minimization (MM) method [28]. To this 
end, we use the notation 

Vi, K = [v(i), ...,v(i + K-l)YeC K (9) 

to denote the i-th group of vector v. Then the penalty function in (4) can be written as 



i?(x) = j2\\*i, 



K\\2- 



(10) 



To majorize -R(x), first note that 
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for all v and u^O, with equality when v = u. The inequality (11) can be derived from t 2 + s 2 > 2ts, Vi, s e 
by setting t = ||"v|| 2 , s = ||u|| 2 - Define 
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If \\ui.K lb > for all i £ I, then it follows from (10) and (11) that 

,g(x,u)>i?(x) (13) 

.g(u,u)=i?(u), (14) 

for all x, u. Therefore, g(x, u) is a majorizer of i?(x) provided that u has no groups equal to zero. Moreover, 
the elements of x are decoupled in <?(x, u) and so g(x, u) can be written as 

5(x,u) = -^r(i;u)|a;(i)| 2 + c (15) 
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(16) 
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and c is a constant that does not depend on x. A majorizer of the cost function -F(x) is now given by 

G(x,u) = i||y-x||2 + A.9(x,u). (17) 

The MM method produces the sequence x^ k \ k > 1, given by: 

x (fc+1) = argmin G(x,x (fc) ) (18) 

X 

= argmin ||y - x||| + AV r(i; x'*') |x(i)| 2 (19) 

x -<- — ^ 

where x^ ' is the initialization. Note that (19) is separable in x(i), so we can write (19) equivalently as 

X {k+1 \i) = argmin (y(i) - xf + \r(i; x' 1 ') |a;| 2 . (20) 

However, the term r(i;x^) is undefined if x^ = 0, i.e. if the i-th group is all zero. This is a manifestation 
of the singularity issue which may arise whenever a quadratic majorizer is used for a non-diffcrcntiablc 
function [28]. Hence, care must be taken to define an algorithm that avoids operations involving undefined 
quantities. 

Consider the following algorithm. Define I' as the subset of X where x^°\i) 7^ 0, 

X' :={i£l : i(°)(i)^0}. (21) 



Define the update equation by: 

y(i) 
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c (W)(i)=jl + Ar(»;xW)' ^ (22) 

i <£ T 

with initialization x( ) = y. Note that the first case of (22) is the solution to (20). The iteration (22) is the 
'overlapping group shrinkage' (OGS) algorithm, summarized in Table 1 in the form of pseudo-code. 

Several observations can be made. 

It is clear from (21) and (22), that 

a;(°)(»)=0 =► x^(i)=0, for all A: > 1. (23) 

Therefore, any values initialized to zero will equal zero throughout the course of the algorithm. However, if 
x iP) = 0, then y(i) = as per the initialization. Note that if y(i) = 0, then the optimal solution x* has 
x* (i) = 0. Therefore 

x (o) (i)=0 => x (k) (i) =x*(i), for allfc> 1. (24) 

As a side note, if y is a noisy data vector, then it is unlikely that y(i) = for any i £ X. 

Now consider the case where x^°\i) ^ 0. Note that if x^ k '(i) ^ 0, then no groups of x( fe ) containing 
x^(i) arc all zero. Hence r(i; x^) is well defined with r(i; x^ fc ') > 0. Therefore y(i)/[l + Xr(i; x' fe ')] is well 
defined, lies strictly between zero and y(i), and has the same sign as y(i). Hence, by induction, 

x (o) (i)^0 => x {k \i) t^O, for all fe > 1. (25) 

Therefore, any value not initialized to zero will never equal zero in any subsequent iteration. However, for 
some % £ I', x^ '(i) may approach zero in the limit as k — > oo. That is, 

x^(i)^0 =£■ lim x {k \i) t^O. (26) 

k— >oo 

In the example in Sect. 4.1, it will be seen that some values do go to zero as the algorithm progresses. In 
practice, (25) may fail for some i € I' due to finite numerical precision. In this case, such i should be removed 
from I' to avoid potential 'division-by-zero' in subsequent iterations. 

The OGS algorithm produces sparse solutions by gradually reducing non-zero values of y toward zero, 
rather than by thresholding them directly to zero on any iteration, as illustrated in Fig. 4 below. 

The output of the OGS algorithm will be denoted asx = ogs(y; X,K), where K is the block size. The 



Table 1: Overlapping group shrinkage (OGS) algorithm for minimizing (2) with penalty function (4). 

Algorithm OGS 

input: y £ R N , A > 0, J 

x = y (initialization) 

1' = {i£ 2, x(i) + 0} 

repeat 



'(*! x ) = Yl 



.kej 
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, iel' 



x(i) = - — v } 1 }. . , i el' 

w 1 + Ar(i;x)' 
until convergence 
return: x 



OGS algorithm can also be applied to multidimensional data y using the above multi-index notation, with 
group size K = {K\, . . . , Kd) and where I and J are multi- indexed sets as described above. 

Note that when the group size is K = 1, then J in (8) is given by J = {0}, and the penalty function (4) 
is simply the i\ norm of x. In this case, the solution is obtained by soft thresholding. When the group size 
K is greater than one, the groups overlap and every element of the solution x depends on every element of y; 
hence, it is not possible to display a multivariate shrinkage function as in [63] for the non-overlapping case. 

Implementation. The quantity r(i; x) in step (2) of the OGS algorithm can be computed efficiently using 
two convolutions — one for the inner sum and one for the outer sum. The inner sum can be computed as the 
convolution of |a;(-)| 2 with a 'boxcar' of size \S\- Denoting by g(-) the result of the inner sum, the outer sum 
is again a running sum or 'boxcar' convolution applied to g{)~ 1 ' 2 ■ In the multidimensional case, each of the 
two convolutions are multidimensional but separable and hence computationally efficient. 

The computational complexity of each iteration of the OGS algorithm is of order KN , where K is the 
group size and N is the total size of y. The memory required for the algorithm is 2N + K . 

Convergence. For the OGS algorithm, due to its derivation using majorization- minimization (MM), it 
is guaranteed that the cost function F(x) monotonically decreases from one iteration to the next, i.e., 
F(x.( k+1 ') < F(x( k ') if x^ ' y£ x*. Yet, the proof of its convergence to the unique minimizer is complicated 
by the 'singularity issue' which arises when a quadratic function is used as a majorizer of a non-diffcrcntiablc 
function [28]. For the OGS problem it is particularly relevant since, as in most sparsity-based denoising prob- 
lems, .F(x) will usually be non-differcntiable at the minimizer x*. Specifically, for OGS, if x* contains any 
group that is all zero, then -F(x) is non-diffcrentiable at x*. However, several results regarding the singularity 



issue are given in [28] which strongly suggest that this issue need not hinder the reliable convergence of such 
algorithms in practice. 

For the OGS algorithm with the initialization x' ' = y, the component x^ fc '(i) will never equal zero except 
when y(i) itself equals zero, as noted above. In the OGS algorithm, the singularity issue is manifested by 
r(i, x' ') approaching infinity for some i G I'. In particular, if x*(i) = for some i G I' , then r(i,x*) is 
undefined (due to 'division- by-zero'). For i e I' such that x*(i) — 0, the value of r(i,x( fc )) goes to infinity as 
the algorithm progresses, and x^ k \i) goes to zero. Note that in the OGS algorithm, the term r(i,x^) is used 
only in the expression y(i)/[l + Xr(i; x' fe ')]. Therefore, even when r(i; x\ ') is very large, this expression is 
still numerically well behaved. (When large values of opposite signs are added to obtain small numbers, the 
result is not numerically reliable, but that is not the case here.) If the algorithm is implemented in fixed point 
arithmetic, then it is indeed likely that the large values of r(i; x^) will lead to overflow for some i € T ' . In 
this case, x^ k '(i) should be set to zero and i should be removed from I' for the subsequent iterations. 

We do not prove the convergence of the OGS algorithm due to the singularity issue. However, in practice, 
the singularity issue does not appear to impede the convergence of the algorithm, consistent with the results 
of [28]. We have found that the empirical asymptotic convergence behavior compares favorably with other 
algorithms that have proven convergence, as illustrated in Fig. 5. 

One approach to avoid the singularity issue is to use the differentiable penalty function 
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i? e (x)=^ (£>(*+./ 
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(27) 



where e is a small positive value, instead of (4). However, as in [53], we have found it unnecessary to do so, 
since we have found that the algorithm is not hindered by the singularity issue in practice. For the regularizcr 
(27), convergence of the corresponding form of OGS can be proven using the Global Convergence Theorem 

(GST) [45,60]. 

Proximal operator. An effective approach for solving a wide variety of inverse problems is given by the 
proximal framework [4,15,16]. In this approach, the solution x to a general inverse problem (e.g. deconvolution, 
interpolation, reconstruction) with penalty function -R(x) can be computed by solving a sequence of denoising 
problems each with penalty function i?(x). Therefore, the efficient computation of the solution to the denoising 
problem is important for the use of proximal methods. In this context, the denoising problem, i.e. (2), is 
termed the proximal operator (or proximity operator). The OGS algorithm is therefore an implementation of 
the proximal operator for penalty function (4), and can be used in proximal algorithms for inverse problems 
that are more general than denoising. As noted above, other implementations of the proximal operator 
associated with overlapping group sparsity have been given [2,3,7,12,20,27,38,40,54,73]; these algorithms 
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are based on the duplication of variables, and hence require more memory (proportional to the group size in 
the fully-overlapping case). 

FOCUSS. The OGS algorithm can be considered as a type of FOCUSS algorithm [34] that is designed to 
yield sparse solutions to under-determined linear systems of equations. It was extended to more general 
penalty functions (or diversity measures) [60] and to the noisy-data case [59]. Setting p = 1 and A = I in 
equation (13) of [59] gives the OGS algorithm for group size K = 1, namely an iterative implementation of 
the soft-threshold rule. (Note, the emphasis of FOCUSS is largely on the non-convex (p < 1) case with i^I 
and only group size K = 1). 

The FOCUSS algorithm was further extended to the case of multiple measurement vectors (MMV) that 
share a common sparsity pattern [17]. We note that the resulting algorithm, M-FOCUSS, is different than 
OGS. In the MMV model, the location (indices) of significant coefficients is consistent among a set of vectors; 
while in the OGS problem there is no common (repeated) sparsity pattern to be exploited. 

The M-FOCUSS algorithm was later improved to account for gradual changes in the sparsity pattern in a 
sequence of measurement vectors [75]. This approach involves, in part, arranging the sequence of measurement 
vectors into overlapping blocks. While both, the algorithm of [75] and the OGS algorithm, utilize overlapping 
blocks, the former algorithm utilizes overlapping blocks to exploit a sparsity pattern (approximately) common 
to multiple measurement vectors, whereas the latter does not assume any common sparsity among blocks. 

The FOCUSS algorithm was extended to mixed norms in [41] to attain structured sparsity without over- 
lapping groups. This approach is extended in [42, 67, 68] where sliding windows are used to account for 
overlapping groups. However, as noted in [42], this approach does not directly correspond to an optimization 
problem; hence it does not constitute an implementation of the proximal operator. 

3 Gaussian noise and OGS 

This section addresses the problem of how to set the regularization parameter A in (2) in a simple and direct 
way analogous to the '3<7 rule'. The use of the 3ct rule for soft thresholding, as illustrated in Sec. 4.1, is simple 
to apply because soft thresholding has a simple explicit form. However, overlapping group shrinkage has no 
explicit form. Therefore, extending the notion of the '3<r rule' to OGS is not straight forward. The question 
addressed in the following is: how should A be chosen so that essentially all the noise is eliminated? In 
principle, the minimum such A should be used, because a larger value will only cause further signal distortion. 
In order to set A so as to reduce Gaussian noise to a desired level, the effect of the OGS algorithm on pure 
i.i.d. zero- mean unit-variance Gaussian noise is investigated. We examine first the case of group size K = 1, 
because analytic formulas can be obtained in this case (OGS being soft thresholding for K = 1). 
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Figure 1: Standard deviation of standard Gaussian noise AT(0, 1) after (a) soft thresholding and (b) overlapping 
group shrinkage (OGS) with group size 3x3. 

Let y ~ A/"(0, 1) and define x = soft(y, T). Then the variance of x as a function of threshold T is given by 



<jI(T) = E[x 2 ]= f 



\v\>t 



2(l + T 2 )Q(T)-TW-exp 



(\y\-T) 2 p y (y)dy 

' 2 



(28) 
(29) 



where p y (y) is the standard normal pdf A/"(0, 1) and 



« (T,: =7b/J v ' f, " =0 - 5 ( 1 - ert (^ 



The standard deviation a x (T) is illustrated in Fig. la as a function of threshold T. Since the variance of x 
is unity here, the 3a rule suggests setting the threshold to T = 3 which leads to a x (3) — 0.020 according to 
(29). 

The graph in Fig. la generalizes the 3a rule: Given a specified output standard deviation a x , the graph 
shows how to set the threshold T in the soft threshold function so as to achieve it, i.e., so that E[soft(y, T) 2 ] = 
a x where y ~ jV(0, 1). For example, to reduce the noise standard deviation a to one percent of its value, we 
solve <7 X (T) = 0.01 for T to obtain T — 3.36cr, a threshold greater than that suggested by the 3a rule. 

To set the regularization parameter A in the OGS algorithm, we suggest that the same procedure can 
be followed. However, for OGS there is no explicit formula such as (29) relating A to a x . Indeed, in the 
overlapping group case, neither is it possible to reduce E[x 2 ] to a univariate integral as in (28) due to the 
coupling among the components of y, nor is there an explicit formula for x in terms of y, but only a numerical 
algorithm. 

Although no explicit analog of (29) is available for OGS, the functional relationship can be found numeri- 
cally. Let y be i.i.d. 7V(0, 1) and define x as the output of the OGS algorithm: x = ogs(y; A, K). The output 
standard deviation a x can be found by simulation as a function of A for a fixed group size. For example, 
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consider applying the OGS algorithm to a two-dimensional array y using a group size of 3 x 3. For this group 
size, a x as a function of A is illustrated in Fig. lb. The graph is obtained by generating a large two-dimensional 
array of i.i.d. standard normal random variables, applying the OGS algorithm for a discrete set of A, and 
then computing the standard deviation of the result for each A. Once this graph is numerically obtained, 
it provides a straight forward way to set A so as to reduce the noise to a specified level. For example, to 
reduce the noise standard deviation a down to one percent of its value, we should use A rs 0.43ct in the OGS 
algorithm according to the graph in Fig. lb. (Obtaining the value A corresponding to a specified a x generally 
requires an extrapolation between the data points comprising a graph such as Fig. lb.) Note that the graph 
will depend on the group size, so for a different group size the graph (or table) needs to be recalculated. For 
efficiency, these calculations can be performed off-line for a variety of group sizes and stored for later use. 

Table 2 gives the value of the regularization parameter A so that OGS produces an output signal y with 
specified standard deviation a x when the input y is standard normal (zero-mean unit-variance i.i.d. Gaussian). 
The table applies to both ID and 2D signals. The table provides values for groups up to length 5 in ID, and 
up to size 5 x 5 in 2D. Note for groups of size 1 x K, the value of A is the same for ID and 2D. Also, note 
that A is the same for groups of size K\ x K^ and K2 x K\\ so each is listed once in the table. The first 
value in each column is obtained using 150 iterations of the OGS algorithm (more than sufficient for accurate 
convergence); while the second value in parenthesis is obtained using 25 iterations (sufficient in practice and 
uses less computation). For group size lxl, OGS is simply soft thresholding; hence no iterations are needed 
for it. The values A listed in Table 2 are accurate to within 0.01 and were computed via simulation. 

To clarify how Table 2 is intended to be used, suppose one is using the OGS algorithm with 2x3 groups for 
denoising a signal contaminated by additive white Gaussian noise with standard deviation a. To reduce the 
noise down to 0.1% of its value, one would use A = 0.74ct if running the OGS algorithm to full convergence; 
or one would use A = 0.77a if only 25 iterations of the OGS algorithm are used. The values are from the 
10~ 3 column in Table 2. 

It can be observed in Fig. 1 that the function a x (-) has a sharper 'knee' in the case of OGS compared with 
soft thresholding. We have examined graphs for numerous group sizes and found that in general the larger 
the group, the sharper is the knee. Note that in practice A should be chosen large enough to reduce the noise 
to a sufficiently negligible level, yet no larger so as to avoid unnecessary signal distortion. That is, suitable 
values of A are somewhat near the knee. Therefore, due to the sharper knee, the denoising process is more 
sensitive to A for larger group sizes; hence, the choice of A is more critical. 

Similarly, it can be observed in Fig. 1 that for OGS, the function a x (-) follows a linear approximation 
more closely to the left of the 'knee' than it does in the case of soft thresholding. We have found that near 
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Table 2: 


Paramctc 


r A for standard normal i 


i.d. 


signal 


Output std a x 


Group 10 


-2 


10~ 3 10~ 4 




10~ 5 



1 x 1 


3.36 


4.38 


5.24 


6.00 


1x2 


1.69(1.73) 


2.15(2.24) 


2.38(2.61) 


2.46(2.94) 


1x3 


1.16(1.18) 


1.46(1.52) 


1.60(1.77) 


1.64(1.99) 


1x4 


0.89(0.91) 


1.12(1.16) 


1.23(1.36) 


1.27(1.53) 


1x5 


0.73(0.75) 


0.92(0.95) 


1.01(1.12) 


1.04(1.25) 


2x2 


0.86(0.87) 


1.08(1.13) 


1.19(1.31) 


1.23(1.48) 


2x3 


0.59(0.61) 


0.74(0.77) 


0.80(0.89) 


0.82(1.01) 


2x4 


0.46 (0.48) 


0.57(0.59) 


0.62(0.69) 


0.64(0.78) 


2x5 


0.38(0.41) 


0.46(0.49) 


0.51(0.57) 


0.52(0.64) 


3x3 


0.41 (0.43) 


0.50(0.53) 


0.55(0.61) 


0.56(0.69) 


3x4 


0.33(0.35) 


0.39(0.42) 


0.43(0.48) 


0.44(0.54) 


3x5 


0.29(0.31) 


0.32(0.36) 


0.35(0.40) 


0.36(0.45) 


4x4 


0.27(0.30) 


0.30(0.34) 


0.33(0.38) 


0.34(0.43) 


4x5 


0.24(0.26) 


0.26(0.30) 


0.27(0.33) 


0.28(0.37) 


5x5 


0.21 (0.23) 


0.22(0.26) 


0.23(0.29) 


0.24(0.32) 


2x8 


0.28(0.30) 


0.31(0.35) 


0.33(0.39) 


0.35(0.43) 



Regularization parameter A to achieve specified output standard 
deviation when OGS is applied to a real standard normal signal: 
full convergence (25 iterations). 



the origin, <7 X (A) is approximated by 



m~ ^ r(|J-|/2 + i/2) 

g - (A) * - V2 T(\J\/2) A ' forASS °' 



(30) 



where \J~\ is the cardinality of the group (K in ID, K\Ki in 2D). This can be explained by noting that for 
y ~ A/"(0,<7 2 ), the £2 norm of the group follows a chi-distribution with \S\ degrees of freedom, the mean of 
which is the slope in (30). For small A, OGS has roughly the effect of soft thresholding the £2 norm of the 
groups. However, it should be noted that small A are of minor interest in practice because in this range, noise 
is not sufficiently attenuated. The right hand side of (30) is illustrated as dashed lines in Fig. 1. 

3.1 Shrinkage/Thresholding behavior 

Although not apparent in Fig. 1, the soft-thresholding and OGS procedures are quite distinct in their shrink- 
age/thresholding behavior. Clearly, if y is i.i.d. with y ~ A/"(0, 1) and if x is the result of soft thresholding 
(i.e. x = soft(y,T)), then x contains many zeros. All values \y\ < T are mapped to zero. The nonlincarity 
involves thresholding in this sense. In contrast, OGS does not produce any zeros unless A is sufficiently large. 
This behavior is illustrated in Fig. 2, which shows the pdf of y after soft thresholding (a) and after OGS (b, 
c). Soft thresholding with T — 1.38 produces an output x with cr x — 0.25. The pdf of x, illustrated in Fig. 2a, 
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Figure 2: Probability density function of zero-mean unit-variance Gaussian noise after soft thresholding (a) 
and OGS (b,c). In (a) and (b), the parameters (T and A) are set so that u x = 0.25. In (b) and (c) OGS was 
applied with group size 3x3. In (a) and (c) the pdf contains a point mass at zero. 

consists of a point mass at the origin of mass 0.831 and the tails of the Gaussian pdf translated toward the 
origin. The point mass represents the zero elements of x. 

Using OGS with group size 3x3 and A set so that again a x = 0.25, the output x does not contain zeros. 
The pdf is illustrated in Fig. 2b. The absence of the point mass at the origin reflects the fact that OGS 
mapped no values of y to zero, i.e. no thresholding. The pdf in Fig. 2b is computed numerically by applying 
the OGS algorithm to simulated standard normal data, as no explicit formula is available. 

When A is sufficiently large, then OGS does perform thresholding, as illustrated in Fig. 2c. For a group 
size of 3 x 3 and A = 0.4, the pdf exhibits a point-mass at the origin reflecting the many zeros in the output 
of OGS. 1 For this value of A, OGS performs both thresholding and shrinkage, like the soft threshold function. 



lr The pdf in Fig. 2c is computed as a histogram; hence, the exact value of the point-mass at the origin depends on the 
histogram bin width. 
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Table 3: Parameter A for standard complex normal i.i.d. signal 







Output 


std a x 




Group 


lO" 2 


10" 3 


10" 4 


10" 5 


1 x 1 


2.54 


3.26 


3.86 


4.39 


1x2 


1.30(1.33) 


1.65(1.71) 


1.82(2.00) 


1.89(2.25) 


1 x 3 


0.90 (0.93) 


1.12(1.17) 


1.22(1.35) 


1.26(1.52) 


1 x 4 


0.71(0.73) 


0.86(0.91) 


0.93(1.04) 


0.96(1.17) 


1 x 5 


0.60 (0.62) 


0.71 (0.75) 


0.76 (0.86) 


0.78(0.96) 


2x2 


0.66 (0.69) 


0.83(0.87) 


0.91(1.01) 


0.94(1.13) 


2x3 


0.48(0.51) 


0.56(0.60) 


0.61 (0.69) 


0.63(0.78) 


2x4 


0.39(0.42) 


0.44 (0.49) 


0.47(0.55) 


0.49(0.61) 


2x5 


0.34(0.37) 


0.37(0.42) 


0.39(0.47) 


0.41(0.53) 


3x3 


0.36 (0.39) 


0.40 (0.45) 


0.42(0.50) 


0.43 (0.56) 


3x4 


0.30(0.33) 


0.32(0.38) 


0.34(0.42) 


0.35 (0.47) 


3x5 


0.27(0.29) 


0.28(0.33) 


0.29(0.37) 


0.30(0.41) 


4x4 


0.26(0.28) 


0.27(0.32) 


0.28(0.36) 


0.29(0.40) 


4x5 


0.23(0.25) 


0.24(0.28) 


0.25(0.32) 


0.25(0.35) 


5x5 


0.20(0.22) 


0.21 (0.25) 


0.22(0.28) 


0.22(0.31) 


2x8 


0.26(0.28) 


0.27(0.32) 


0.28(0.36) 


0.29(0.40) 



3.2 Complex data 

The calculation (29) is for real- valued standard normal x. For complex- valued Gaussian data the formula is 
slightly different: 



\\v\ -T\ 2 p v {y)dy 

'\V\>T 

= cxp(-T 2 ) - 2y/^TQ(V2T) 



(31) 

(32) 



where p y (y) is the zero-mean unit- variance complex Gaussian pdf (standard complex normal), CAf(0, 1). In 
the complex case, (30) is modified to 



<7xW 



r(|j-| + i/2) 
r(|J|) 



A, for ArjO, 



(33) 



as the degrees of freedom of the chi-distribution is doubled (due to real and imaginary parts of complex 
y). Because complex- valued data is common (using the Fourier transform, STFT, and in radar and medical 
imaging, etc.), it is useful to address the selection of A in the complex case as well as in the real case. Note 
that Table 2 does not apply to the complex case. Table 3 gives the value of A for the complex case, analogous 
to Table 2 for the real case. 
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3.3 Remarks 

The preceding sections described how the parameter A may be chosen so as to reduce additive white Gaussian 
noise to a desired level. However, in many cases the noise is not white. For example, in the speech denoising 
example in Section 4.2, the OGS algorithm is applied directly in the STFT. However, the STFT is an 
overcomplete transform; therefore, the noise in the STFT domain will not be white, even if it is white in the 
original signal domain. In the speech denoising example in Section 4.2 below, this issue is ignored and the 
algorithm is still effective. However, in other cases, where the noise is more highly correlated, the values A in 
Table 2 and 3 may be somewhat inaccurate. 

The penalty function (4) is suitable for stationary noise; however, in many applications, noise is not 
stationary. For example, in the problem of denoising speech corrupted by stationary colored noise, the 
variance of the noise in the STFT domain will vary as a function of frequency. In particular, some noise 
components may be narrowband and therefore occupy a narrow time-frequency region. The OGS penalty 
function and algorithm, as described in this paper, do not apply to this problem directly. The penalty function 
(4) and the process to select A must be appropriately modified. 

The OGS algorithm as described above uses the same block size over the entire signal. In some applications, 
it may be more appropriate that the block size varies. For example, in speech denoising, as noted and 
developed in [72] , it is beneficial that the block size in the STFT domain varies as a function of frequency 
(e.g. for higher temporal resolution at higher frequency). In addition, the generalization of OGS so as to 
denoise wavelet coefficients on tree-structured models as in [61] may be of interest. 

4 Examples 

4.1 One-dimensional denoising 

As an illustrative example, the OGS algorithm is applied to denoising the one-dimensional group sparse signal 
in Fig. 3a. The noisy signal in Fig. 3b is obtained by adding independent white Gaussian noise with standard 
deviation a = 0.5. The dashed line in the figure indicates the '3c' level. The '3cr rule' states that nearly all 
values of a Gaussian random variable lie within three standard deviations of the mean (in fact, 99.7%). Hence, 
by using 3<7 as a threshold with the soft threshold function, nearly all the noise will be eliminated as illustrated 
in Figure 3c with threshold T — 3c = 1.5. Although the noise is effectively eliminated, the large- amplitude 
values of the signal have been attenuated, which is unavoidable when applying soft thresholding. 

Even though this choice of threshold value does not minimize the RMSE, it is a simple and intuitive 
procedure which can be effective and practical in certain applications. Moreover, this rule does not depend 
on the signal energy (only the noise variance), so it is straight-forward to apply. Regarding the RMSE, it 
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Figure 3: Group sparse signal denoising: comparison of soft thresholding and overlapping group shrinkage 
(OGS). OGS yields the smaller RMSE. 

should be noted that optimizing the RMSE docs not always lead to the most favorable denoising result in 
practice. For example, in speech enhancement /denoising, even a small amount of residual noise will be audible 
as 'musical noise' [44]. (In speech denoising via STFT thresholding, the RMSE-optimal threshold produces a 
denoised signal of low perceptual quality.) 

The result of applying the OGS algorithm to the noisy signal is illustrated in Fig. 3d. Twenty-five iterations 



18 




Figure 4: Convergence of OGS algorithm as applied in Fig. 3 . 

of the algorithm were used, with group size K = 5 and parameter 2 A = 0.68er = 0.34. As visible, the noise 
has been essentially eliminated. Compared to soft thresholding in Fig. 3c, the large-amplitude values of the 
original signal are less attenuated, and hence the RMSE is improved (0.52 compared to 0.82). In this example, 
the RMSE comparison between the two methods is meaningful because both methods have been used with 
parameter values chosen so as to eliminate essentially all the noise. 

The convergence of x( fe ) to x* is illustrated in Fig. 4, where it can be seen that numerous values x^ k '(i) 
converge to zero. In this example, with initialization x^(i) ^ 0, Vz G I, we have x^(i) ^ 0, Vi G I for all 
subsequent iterations fc > 1. Yet, entire groups converge to zero. 

The convergence behavior in terms of the cost function history of the OGS algorithm is compared with 
three other algorithms in Fig. 5, namely to ADMM [20], CDAD (coordinate descent algorithm for the dual 
problem) [3,40], and the algorithm of [73] as implemented in the SLEP (Sparse Learning with Efficient 
Projections) software package 3 . The figure shows the cost function history for the first 150 iterations of 
each algorithm. The OGS algorithm exhibits a monotone decreasing cost function that attains the lowest 
cost function value after 25 iterations. The ADMM, CDAD, and OGS algorithms have about the same 
computational cost per iteration, O(NK). We implemented these three algorithms in MATLAB. The SLEP 
software is written in C (compiled in MATLAB as a mex file) and each iteration of SLEP performs several 
inner optimization iterations. 

The run-time for 150 iterations of SLEP, CDAD, ADMM, and OGS are 3, 99.5, 36.7 and 7.7 milliseconds, 
respectively. Note that SLEP and OGS are much faster than the other two. SLEP is written in C/Mex, 



2 The parameter A is chosen so that 25 iterations of the OGS algorithm with group size K = 5 applied to white Gaussian 
noise produces the same output standard deviation as soft thresholding using threshold T = 3cr. This method for selecting A is 
elaborated in Section 3. 

3 The SLEP software was downloaded from http://www.public.asu.edu/-jye02/Software/SLEP/. We thank the author for 
verifying the correctness of our usage of SLEP. We also note that SLEP is a versatile suite of algorithms that can solve a variety 
of sparsity-related optimization problems. 
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Figure 5: Cost function history: comparison of algorithms for 1-D dcnoising. 

so it is difficult to compare its run-time with the other three algorithms which are written in MATLAB. 
The MATLAB implementation of OGS is fast due to (i) its minimal data indexing, (ii) its minimal memory 
usage (no auxiliary/splitting variables), (in) its computational work is dominated by convolution which is 
implemented very efficiently with the MATLAB built-in conv function. 

It should also be noted that ADMM requires that parameters be specified by the user, which we manually 
set so as to optimize its convergence behavior. The algorithm of [73] also calls for parameters, but the SLEP 
software provides default values which we used here. The CDAD and OGS algorithms do not require any 
user parameters to be specified. We also note that ADMM and CDAD require 5-times the memory of OGS, 
as the group size is K = 5 and the groups are fully overlapping. In summary, the OGS algorithm has 
minimal memory requirements, requires no user specified parameters, and has the most favorable asymptotic 
convergence behavior. 

We have found that for some problems OGS has a slow convergence during the initial iterations, com- 
pared to the other algorithms. This is because OGS does not perform explicit thresholding as do the other 
algorithms; instead, OGS gradually reduces values toward zero. It may be appropriate to perform several 
iterations of another algorithm or preliminary thresholding to improve initial convergence, and then switch 
to OGS. The comparison here uses OGS alone. 

Partially overlapping groups. It may be expected that partially overlapping groups may also serve as 
a useful signal model. To this end, we performed numerical experiments to evaluate the utility of partially 
overlapping group sparsity by modifying the penalty function (4) so that the outer sum is over a sub-grid of 
1. We used a set of signals like that of Fig. 3a where each group is systematically translated in turn, and 
we modified the OGS algorithm accordingly. For each signal, the RMSE-optimal A was numerically found, 
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Table 4: Average minimum RMSE for partial overlap (K = 5) 







overlap M 






a 





1 2 


3 


4 


0.5 
1 

2 


0.3925 
0.7282 
1.2135 


0.3978 0.3935 
0.7391 0.7344 
1.2303 1.2291 


0.3886 
0.7248 
1.2080 


0.3846 
0.7176 
1.1954 



and the corresponding optimal RMSE recorded for several values of the noise variance a 2 . Averaging over 
100 realizations for each group position and overlap, we obtain the RMSE values shown in Table 4.1. The 
fully-overlapping case (overlap M = 4) gives the lowest RMSE, as might be expected. However, the non- 
overlapping case (overlap M = 0) does not give the worst RMSE. It turns out that partial overlapping can 
yield an inferior RMSE compared to both fully-overlapping and non-overlapping cases. The reason for this is 
that, in the partial overlapping case, some components of x will be a member of only one group, while other 
components will be a member of two or more groups. The more groups a component is a member of, the more 
strongly it is penalized. Hence, components of x are non-uniformly penalized in the partially-overlapping case, 
and this degrades its performance when the group structure is not known a priori. 

4.2 Speech Denoising 

This section illustrates the application of overlapping group shrinkage (OGS) to the problem of speech en- 
hancement (denoising) [44]. The short-time Fourier transform (STFT) is the basis of many algorithms for 
speech enhancement, including classical spectrum subtraction [6,48], improved variations thereof using non- 
Gaussian models [35,47], and methods based on Markov models [21,26,71]. A well known problem arising 
in many speech enhancement algorithms is that the residual noise is audible as 'musical noise' [5]. Musical 
noise may be attributed to isolated noise peaks in the time-frequency domain that remain after processing. 
Methods to reduce musical noise include over estimating the noise variance, imposing a minimum spectral 
noise floor [5], and improving the estimation of model parameters [9]. Due to the musical noise phenomenon, 
it is desirable to reduce the noise sufficiently, even if doing so is not optimal with respect to the RMSE. 

To avoid isolated spurious time-frequency noise spikes (to avoid musical noise), the grouping/clustering 
behavior of STFT coefficients of speech waveforms can be taken into account. To this end, a recent algorithm 
by Yu et al. [72] for speech/audio enhancement consists of time-frequency block thresholding. We note that 
the algorithm of [72] is based on non-overlapping blocks. Like the algorithm of [72], the OGS algorithm 
aims to draw on the grouping behavior of STFT coefficients so as to improve the overall denoising result, 
but it uses a model based on fully overlapping blocks. Some other recent algorithms exploiting structured 
time-frequency sparsity of speech/audio are based on Markov models [21,26,71]. 
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Figure 6: Spectrograms of (a) noisy speech and (b) result of soft thresholding. Spectrograms denoised by 
(c) block thresholding [72] and (d) overlapping group shrinkage (OGS). Gray scale represents decibels. 

To apply OGS to speech denoising, we solve (2), where y = <I>s is the complex short-time Fourier transform 
(STFT) of the noisy speech s, $ is the STFT, x is the STFT coefficients to be determined, and i_(x) is (4). 
The estimated speech is then given by x = $*x. To minimize (2), we use the two-dimensional form of the 
OGS algorithm applied to the STFT coefficients. 

For illustration, consider the noisy speech illustrated in Fig. 6a. The noisy speech is a male utterance 
sampled at 16 kHz, corrupted by additive white Gaussian noise with SNR 10 dB. 4 The STFT is calculated 
with 50% overlapping blocks of length of 512 samples. Figure 6b illustrates the STFT obtained by soft 
thresholding the noisy STFT, with threshold T selected so as to reduce the noise standard deviation down 
to 0.1% of its value. From Table 3, we obtain T — 3.26cr, where a is the noise standard deviation in the 
STFT domain. The noise is sufficiently suppressed so that musical noise is not audible; however, the signal is 
somewhat distorted due to the relatively high threshold used. The spectrogram in Fig. 6b is perhaps overly 
thinned. 

Figure 6c illustrates the result of block thresholding [72] using the software provided by the authors. 5 



4 Speech file arctic_a0001.wav downloaded from http://www.speech.cs.cmu.edu/cmu_arctic/cmu_us_bdl_arctic/wav. 
5 http : //www . cmap . polytechnique . f r/~yu/research/ABT/samples . html 
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The SNR is 15.35 dB. It can be seen that BT produces blocking artifacts in the spectrogram. Some auditory 
artifacts are perceptible in the BT denoised speech. 

Figure 6d illustrates the result of overlapping group shrinkage (OGS) applied to the noisy STFT. We 
used 25 iterations of the OGS algorithm. Based on listening to speech signals denoised with various group 
sizes, we selected a group size 8x2 (i.e. eight frequency bins x two time bins). Other group sizes may be 
more appropriate for other sampling rates and STFT block lengths. As in the soft thresholding experiment, 
the parameter A was selected so as to reduce the noise standard deviation down to 0.1% of its value. From 
Table 3, we obtain A = 0.32o\ The SNR of the denoised speech is 13.77 dB. While the SNR is lower than 
block thresholding, the artifacts are less audible. As for soft-thresholding with A selected likewise, musical 
noise is not audible in the resulting waveform obtained by applying the inverse STFT. 

It was found in [72] that empirical Wiener post-processing (EWP), introduced in [32], improves the result of 
the block thresholding (BT) algorithm. This post-processing, which is computationally very simple, improves 
the result of OGS by an even greater degree than for BT, as measured by SNR improvement. The Wiener 
post-processing raises the SNR for BT from 15.35 dB to 15.75 dB, while it raises the SNR for OGS from 
13.77 dB to 15.63 dB. Hence, the two methods give almost the same SNR after Wiener post-processing. The 
substantial SNR improvement in the case of OGS can be explained as follows: The OGS algorithm has the 
effect of slightly shrinking (attenuating) large coefficients which produces a bias and negatively affects the 
SNR of the denoised signal. The Wiener post-processing procedure largely corrects that bias; it has the effect 
of rescaling (slightly amplifying) the large coefficients appropriately. 

Signal-domain data fidelity. Note that, due to the STFT not being an orthonormal transform, the noise 
in the STFT domain is not white, even when it is white in the signal domain. Therefore, instead of problem 
(2), it is reasonable to solve 

min l||s-$*x||l + Ai?(x), (34) 

x I 

which is consistent with the white noise assumption (the data fidelity term is in the signal domain) . Problem 
(34) can be solved by proximal methods (e.g. forward-backward splitting (FBS)) [4,15,16] or ADMM methods 
[7,20,27]. 

We note that the solving (34) takes more computation than solving (2). Formulation (2) is, by definition, 
a single proximal operator (which OGS implements). On the other hand, solving (34) (using ADMM or FBS, 
etc.) requires the repeated application of the proximal operator, interlaced with $ and $* (forward and 
inverse STFTs). In our experiment, solving (34) using ADMM took 5.95 seconds, while solving (2) took 0.25 
seconds. We ran 25 iterations of OGS in each ADMM iteration to implement the proximal operator. 

Figure 7 shows the RMSE between the estimated speech x and the original speech as a function of A for 
each of (2) and (34). Note that (34) attains a better RMSE (higher SNR) than (2), but the difference is 
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Figure 7: Comparison of RMSE using STFT-domain cost function (2) and signal-domain cost function (34). 
Formulation (34) gives a slightly better RMSE. The proposed value (A = 0.32<r) is indicated by the arrow. 

marginal and imperceptible. We also note that, for both formulations (2) and (34), when A is optimized so 
as to obtain the minimum RMSE (best SNR), 'musical noise' is clearly audible in the denoised speech. In 
terms of perceptual quality, a higher value of A gives a much better result. Our proposed method for setting 
A, indicated by the arrow in Fig. 7, gives a higher RMSE, but the result is perceptually preferable due to 
the suppression of 'musical noise'. Note that the increase in RMSE, due to using a higher A so as to avoid 
musical noise, outweighs the increase in RMSE due to using (2) instead of (34). Due to the imperceptible 
difference between (2) and (34), and the higher computational complexity of the latter, the formulation (2) 
appears suitable for speech denoising using OGS. 

Further comparisons. This speech denoising example is intended as an illustrative example of OGS rather 
than state-of-the-art speech denoising. However, to provide further context, the output SNR of several 
algorithms, with and without empirical Wiener post-processing (EWP), are summarized in Table 5. The 
additional algorithms are: spectral subtraction (SS) [5], the log MMSE algorithm (LMA) [14], the subspace 
algorithm (SUB) [36], and persistent shrinkage (PS) [68]. For SS, LMA, and SUB, we used the MATLAB 
software provided in Ref. [44]. For PS, we used the software provided by the authors. 6 Without EWP, BT 
achieves the highest SNR of 15.35 dB. However, the BT has slightly audible artifacts as noted above, as does 
PS. The artifacts of SUB and OGS are less audible. The artifacts of SS and LMA are clearly audible. 

EWP improves the SNR of SUB and LMA by 1.14 dB and 1.83 dB; however, perceptual artifacts are still 
clearly audible. EWP improves BT by 0.4 dB, and has a minor impact on perceptual quality. EWP improves 
SUB, PS and OGS by 2.09 dB, 2.14 dB, and 1.86 dB, and slightly reduces the audible artifacts. We note that 
the form of EWP used in PS is a persistent form introduced in [68]. With EWP, four methods (BT, SUB, 
PS, and OGS) achieve almost the same SNR (with varying degrees of audible musical noise). However, SUB 
has a high computational complexity due to eigenvalue factorization. 



6 http : //homepage . univie . ac . at/monika . doerf ler/StrucAudio . html 
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Table 5: Output SNR of several speech enhancement algorithms, with and without empirical Wiener post- 
processing (EWP). 

SNR (dB) 

Method w/o EWP EWP 

spectral subtract. (SS) [5] 13.50 14.64 

log MMSE alg. (LMA) [14] 13.10 14.93 

subspace alg. (SUB) [36] 13.68 15.77 

block-thresh. (BT) [72] 15.35 15.75 

persistent shrink. (PS) [68] 13.63 15.77 

OGS (proposed) 13.77 15.63 

5 Conclusion 

This paper introduces a computationally efficient algorithm for denoising signals with group sparsity struc- 
ture. 7 In this approach, 'overlapping group shrinkage' or OGS, the groups are fully overlapping and the 
algorithm is translation-invariant. The method is based on the minimization of a convex function. 

A procedure is described for the selection of the regularization parameter A. The procedure is based on 
attenuating the noise to a specified level without regard to the statistical properties of the signal of interest. 
In this sense, the procedure for setting A is not Bayesian; it does not depend on the signal to be estimated. 
Even though the described procedure for setting A is conceptually simple, it does not admit the use of explicit 
formulas for A because, in part, the OGS function does not itself have an explicit formula (it being defined 
as the solution to a minimization problem). The procedure to set A is based on analyzing the output of 
OGS when it is applied to an i.i.d. standard normal signal, and can be implemented in practice by off-line 
computation of tables such as Tables 2 and 3. Adopting recent approaches for regularization parameter 
selection [19] provides another potential approach to be investigated. 

The paper illustrates the use of overlapping group shrinkage for speech denoising. The OGS algorithm 
is applied to the STFT of the noisy signal. Compared to the block thresholding algorithm [72], the OGS 
algorithm gives similar SNR when Wiener post-processing is used. However, the OGS denoised speech has 
fewer perceptual artifacts. 

Current work includes the investigation of the group-sparsity promoting penalty function proposed in 
[38,52], namely, the derivation of a simple quadratic MM algorithm implementing the proximal operator, 
the development of a procedure for setting the regularization parameter A along the lines considered here, 
comparison of convergence behavior with other algorithms, and an evaluation for speech denoising. 



7 MATLAB software to implement the OGS algorithm and to reproduce the figures in the paper are online at http://eeweb. 
poly.edu/iselesni/ogs/. 
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Supplementary Material 

MATLAB program for the implementation of overlapping group shrinkage (ID). 



function [a, cost] = ogshrink(y, K, lam, Nit) 

"/, [a, cost] = ogshrink(y, K, lam, Nit); 

7, Overlapping group shrinkage (OGS) 

7, Minimizes the cost function with respect to a 

"/. 

"/, cost = 0.5 * sum(abs(y - a). ~2) + lam * sum(sqrt (conv(abs(a) . "2, ones(l ,K) )) ) ; 

% 

INPUT 

1-D noisy signal (vector) 

size of group 

regularization parameter 

number of iterations 



•/. 


INPUT 


7. 


y 


7 


K 


% 


lam 


7. 


Nit 


7. 




7. 


OUTPUT 


7. 


a 


7. 


cost 



output (denoised signal) 
cost function history 

7. Po-Yu Chen and Ivan Selesnick 

7. Polytechnic Institute of New York University 

7. New York, USA 

7. March 2012 

a = y; 7* initialize 

h = ones(l,K); 7. for convolution 

cost = zeros(l.Nit) ; 

i = (a -= 0); 

for it = l:Nit 

r = sqrt(conv(abs(a) . ~2, h) ) ; 

cost(it) = 0.5*sum(abs(y - a) . "2) + lam * sum(r) ; 

v = 1 + lam*conv(l ./r, h) ; 

v = v(K:end+l-K); 

'/, In newer MATLAB, the above 2 lines can be replaced with 1 line: 

'/. v = 1 + lam*conv(l./r, h, 'valid'); 

a = y./v; 
end 
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MATLAB program for the implementation of overlapping group shrinkage (2D). 



function [a, cost] = ogshrink2(y, Kl, K2, lam, Wit) 

'/, [a, cost] = ogshrink2(y, Kl, K2, lam, Nit); 

'/, 2D overlapping group shrinkage (OGS) 

'/, Minimizes the cost function with respect to a 

% 

"/, cost = 0.5*sum(sum(abs(y - a).~2)) + lam * sum(sqrt (conv(abs(a) . "2, ones(Kl,K2) ) ) ) ; 

% 

INPUT 

2-D noisy signal (2D array) 

size of group 

regularization parameter 

number of iterations 



•/. 


INPUT 


7. 


y 


•/. 


Kl, K2 


% 


lam 


7. 


Nit 


% 




7. 


OUTPUT 


7. 


a : 


7. 


cost : 



output (denoised signal) 
cost function history 

7. Po-Yu Chen and Ivan Selesnick 

7. Polytechnic Institute of New York University 

7, New York, USA 

7. March 2012 

a = y; 7* initialize 

hi = ones(Kl,l); 7. for convolution 

h2 = ones(K2,l); 7. for convolution 

cost = zeros(l,Nit) ; 

for it = l:Nit 

r = sqrt(conv2(hl, h2, abs(a).~2)); 

cost(it) = 0.5*sum(sum(abs(y - a). "2)) + lam * sum(r( : )) ; 

v = 1 + lam*conv2(hl, h2, l./r); 

v = v(Kl:end+l-Kl, K2 :end+l-K2) ; 

'/, In newer MATLAB, the above 2 lines can be replaced with 1 line: 

% v = 1 + lam*conv2(hl, h2, l./r, 'valid'); 

a = y./v; 
end 
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